Conversation with Annett

Emphasis that this is an ‘internal’ document with messy graphs and choppy code.

I had an in depth meeting with Anne last Thursday – she pointed me where to dive deeper and what to cut. Therefore, there is a slight modification to this document since you saw it a few weeks ago for the first time.

Instead of doing a mix of parametric and non-parametric throughout this document, Anne instructed me to choose one and stick with it. Based on the Shapiro-Wilk Results, a little more than half of the datasets were non-parameric, and the parametric ones where borderline. You will see this shift in this version of the document.

I met with the masters students, Tobi and Simon, approximately 2 weeks ago to push them to ‘narrow’ down what they are focusing on. I am focusing on overall touch frequency to season and match outcomes. Additionally, I have a very primitive section on reciprocity and social networks. I hope Tobi and simon can focus more on the touch actions, toucher/touchee relationships, and haptic rituals.

For your knowledge (based on my understanding):

Simon is interested in the sub-groups and their behaviors on the field. For example, how does the defensive unit interact with one another and the goal keeper? What about the scorer and assistor?

Tobias is very interested in the contextual factors of the match associated with touch. He told me he was persuing the Comeback Hypothesis! Yay. I think that one will be interesting.

Executive Summary

This report presents an initial analysis of pro-social touch behaviors in the NWSL 2024 season, exploring how patterns of touch between teammates may be associated with team performance and match outcomes. Using a dataset of ~14,000 touch instances (filtered to ~7,000 pro-social touches), I tested four primary hypotheses:

  1. Overall Touch & Team Success
    Hypothesis: Teams with higher frequencies of pro-social touch across the season would achieve better final standings.
    Finding: No strong association was observed between total pro-social touch frequency over the season and final standings (counter to prior findings in basketball). Further work could explore alternate modeling approaches (e.g., controlling for goal celebrations).

  2. Match-Level Touch Variation & Outcomes
    Hypothesis: Matches where teams exceeded their typical touch behavior would be associated with better match outcomes.
    Finding: A significant positive correlation was observed — matches where teams displayed higher-than-usual prosocial touch behavior were associated with more favorable goal differentials.

  3. Underdog Hypothesis
    Hypothesis: Higher levels of prosocial touch would be particularly associated with better outcomes for underdog teams facing stronger opponents.
    Finding: No synergistic relationship between touch and underdog status was observed. Touch seems to help teams regardless of their rank. It has an addictive effect, not a synergistic one. Underdogs do not benefit more fromtouch than favorites, but they still do benefit. I created confusion in the section on my last edition of this report. I had been running the analysis on 2/3s of the data waiting for Simon’s collection to be finished. This was the only section with a significant difference in outcomes.

  4. Social Network & Reciprocity
    Hypothesis: Teams with a higher ratio of reciprocal (vs. non-reciprocal) touches, and more evenly distributed touch behaviors across players, would finish higher in the standings.
    Finding: A higher reciprocal-to-non-reciprocal touch ratio was significantly associated with better standings. Measures of overall touch distribution across players (CV, Gini) were not significantly associated with final standings.

Inter-Rater Reliability:
An Intraclass Correlation Coefficient (ICC = 0.79) indicates good agreement between coders, providing confidence in the reliability of the dataset. Additionally looked at the ICC fore reciprocity (Reciprocal with ICC = 0.82 and non-reciprocal with ICC = 0.68).

Overall Takeaway:
While total season-long touch frequency alone was not associated with team success, the patterns and contexts of prosocial touch appear meaningfully associated with match outcomes. These early results suggest that prosocial touch behavior may serve as a useful marker of team cohesion and resilience in professional sport, warranting further study.

Next Steps / deeper dive Anne emphasized showing graphs of ‘raw data’. Especially interested in team by team breakdown. Therefore, many of these graphs you see will be blown-out from a global all-team to an additional 14 graphs (one for each team).

Anne and I decided where to dive deeper next based on how the touch frequency rate chagnes throughout a match. Tehre is a steady increase in touch frequency as the game progress. Therefore, we think it would be interesting to see if we can find a catious relationship between touch and winning. Basically asking the following questions: Is touch assocaited with winning or is winning associated with touch? I would compare teh baseline rates of touch a team (or perhaps the overall league) exhibits during that time typically in a match to the touch frequency leading up to the first goal scored. See the handwritten paper.

Introduction for this RMarkdown

Hi team (Anne and Annett)! This is my first true stab at the data. I addressed 4 different hypotheses, each with their own sections. The first hypothesis does not seem to have a ‘positive’ outcome however the last three show some promise. With full intent to bias you, the underdog hypothesis (number 3) is my favorite.

I will step through each of the hypotheses in order (as seen in the executive summary above), with their corresponding R-code, outputs, and a quick discussion.

At the end, I have a quick mention to the Inter-Rater analysis. I had hoped to find a way to automate an event by event comparison, however the task seems to beating me. I am currently only doing ICC for sheer frequency of touches across each inter-rater match.

Additionally, here is a link to the git-hub repository containing my code and the raw data spreadsheets if you would like to take a look. I may have to give you permission to access them, apologies for the extra step: (https://github.com/paigeliebel/Soccer_Touch.git)

For readability, the code required for cleaning, merging, and fixing typos in the data frames was excluded from this report.

Hypothesis 1: Touch as a Metric for End of Season Success

Core Hypothesis: Pro-social touches between teammates may serve as an indicator for overall team cohesion. Therefore, we propose that teams with a greater frequency of these interactions will secure higher positions in the season’s final standings.

Definitions:

Pro-social touches are defined as all haptic rituals recorded excluding: Tactical Adjustments, Collisions, and Negative Touch

Situations excluded from analysis for this hypothesis: Goals For/Against, Substitutions

# Core_Hypothesis
# Added Outlier Analysis as well
# For more information on these data frames please look at the README.md file

library(tidyverse)
library(data.table)
library(broom)
library(janitor)
library(readxl)
library(rmarkdown)
library(readr)
library (dplyr)

source("Data_Management.R") #Runs and brings in data frames from Data_Management.R script

#Ensure to use the correct dfs. Touches_final and Matches_final are correct. They only include assigned rater data, no repeat matches

#Check to make sure data frames are loaded:
if (!exists("Touches_final") | !exists("Matches_final") | !exists("FinalStandings")) {
  stop("Touches_final, Matches_final, or FinalStandings not loaded. Check Data_Management.R.")
}

############################ Create Data Set for this Hypothesis ############################ 

# Prosocial touches are defined as all haptic rituals recorded excluding: 
#   Tactical Adjustments, Collisions, and Negative Touch
# 
# Situations excluded from analysis for this hypothesis: Goals For/Against, Substitutions

Exclude_Touch <- c("TA", "CO", "NEG")
Exclude_Situation <- c("GF", "GA", "SUB") 
#Exclude_Visibility <- c("P")

#Creates data set for core hypothesis analysis
Touches_CoreHyp <- Touches_final %>%
  filter(!(HapticRitual %in% Exclude_Touch)) %>%
  filter(!(Situation %in% Exclude_Situation)) #%>% 
  #filter(!(Visibility %in% Exclude_Visibility))

#Count of frequency of touches per team
Touches_by_team <- Touches_CoreHyp %>%
  mutate(Team = str_trim(as.character(Team))) %>%
  count(Team, name = "TotalTouches")

# Make sure TeamID is padded to match
FinalStandings <- FinalStandings %>%
  mutate(TeamID = str_pad(as.character(TeamID), width = 2, pad = "0"))

#Join touch counts with final standings
Team_Touches_Standings <- FinalStandings %>%
  left_join(Touches_by_team, by = c("TeamID" = "Team")) %>%
  filter(!is.na(TotalTouches))

#Plot with regression line : final rankings to frequency of touch
TouchFreq_vs_FinalStandings <- ggplot(Team_Touches_Standings, aes(x = Rank, y = TotalTouches)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = FALSE, color = "blue", linewidth = 1) +
  scale_x_reverse() +
  labs(
    title = "Final Rank vs Overall Touch Frequency",
    x = "Final Season Rank",
    y = "Total Touches (Filtered)"
  ) +
  theme_minimal()

TouchFreq_vs_FinalStandings_Stats <- cor_result <- cor.test(Team_Touches_Standings$TotalTouches, Team_Touches_Standings$Rank)

#Bar graph of Total Touches per Team over the season
Touches_per_Team <- ggplot(Team_Touches_Standings, aes(x = reorder(TeamID, -TotalTouches), y = TotalTouches)) +
  geom_col(fill = "steelblue") +
  labs(
    title = "Total Touches per Team (Season)",
    x = "Team",
    y = "Total Touches"
  ) +
  theme_minimal()

############################ Investigating outliers ############################ 

# Check distribution and identify potential outlier
Team_Touches_Standings %>%
  arrange(desc(TotalTouches))
## # A tibble: 14 × 12
##     Rank Team              TeamID Points MatchesPlayes SeasonWLD HomeWLD AwayWLD
##    <dbl> <chr>             <chr>   <dbl>         <dbl> <chr>     <chr>   <chr>  
##  1    11 Utah Royals FC    13         25            26 7-15-4    4-7-2   3-8-2  
##  2     7 Bay FC            02         34            26 11-14-1   5-8-0   6-6-1  
##  3     2 Washington Spirit 14         56            26 18-6-2    10-2-1  8-4-1  
##  4    12 Angel City FC     01         24            26 7-13-6    3-7-3   4-6-3  
##  5     1 Orlando Pride     09         60            26 18-2-6    10-0-3  8-2-3  
##  6    13 Seattle Reign FC  08         23            26 6-15-5    4-5-4   2-10-1 
##  7     4 Kansas City Curr… 05         55            26 16-3-7    9-1-3   7-2-4  
##  8     5 North Carolina C… 07         39            26 12-11-3   9-1-3   3-10-0 
##  9     6 Portland Thorns … 10         34            26 10-12-4   7-5-1   3-7-3  
## 10    14 Houston Dash      04         20            26 5-16-5    2-7-4   3-9-1  
## 11     3 NJ/NY Gotham FC   06         56            26 17-4-5    9-1-3   8-3-2  
## 12     8 Chicago Red Stars 03         32            26 10-14-2   4-9-0   6-5-2  
## 13     9 Racing Louisvill… 11         28            26 7-12-7    6-5-2   1-7-5  
## 14    10 San Diego Wave FC 12         25            26 6-13-7    5-5-3   1-8-4  
## # ℹ 4 more variables: SeasonGoalsFor <dbl>, SeasonGoalsAgainst <dbl>,
## #   GoalDifferential <dbl>, TotalTouches <int>
library(ggrepel)

outliers <- ggplot(Team_Touches_Standings, aes(x = Rank, y = TotalTouches, label = TeamID)) +
  geom_point(size = 3) +
  geom_text_repel() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  scale_x_reverse() +
  labs(
    title = "Final Rank vs Touch Frequency (w/ Labels)",
    x = "Final Season Rank",
    y = "Total Touches"
  ) +
  theme_minimal()

Touches_per_game <- Touches_CoreHyp %>%
  group_by(Team, SeasonMatchNumber) %>%
  summarise(TouchCount = n(), .groups = "drop")

# Prep: Join team rank for ordering
Touches_per_game_ranked <- Touches_per_game %>%
  mutate(Team = str_pad(as.character(Team), width = 2, pad = "0")) %>%
  left_join(FinalStandings %>% select(TeamID, Rank), by = c("Team" = "TeamID")) %>%
  filter(!is.na(Rank))

ggplot(Touches_per_game, aes(x = TouchCount)) +
  geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
  labs(
    title = "Distribution of Touches per Game (All Teams)",
    x = "Touches per Game",
    y = "Number of Matches"
  ) +
  theme_minimal()

# Replace "xx" with the team you're investigating
team_focus <- "13"

Touches_per_game %>%
  mutate(IsTarget = ifelse(Team == team_focus, "Target Team", "Others")) %>%
  ggplot(aes(x = TouchCount, fill = IsTarget)) +
  geom_histogram(binwidth = 5, position = "identity", alpha = 0.7, color = "white") +
  scale_fill_manual(values = c("Target Team" = "red", "Others" = "gray")) +
  labs(
    title = paste("Touches per Game Distribution — Highlighting Team", team_focus),
    x = "Touches per Game",
    y = "Number of Matches",
    fill = "Team"
  ) +
  theme_minimal()

#Histogram per team
ggplot(Touches_per_game, aes(x = TouchCount)) +
  geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
  facet_wrap(~ Team, ncol = 4) +
  labs(
    title = "Touches per Game Distribution by Team",
    x = "Touches per Game",
    y = "Number of Matches"
  ) +
  theme_minimal()

#Density plot by team
ggplot(Touches_per_game, aes(x = TouchCount, color = Team)) +
  geom_density(size = 1, alpha = 0.7) +
  labs(
    title = "Touches per Game: Density by Team",
    x = "Touches per Game",
    y = "Density",
    color = "Team"
  ) +
  theme_minimal()

#RidgePlot
# Requires ggridges
library(ggridges)

ggplot(Touches_per_game, aes(x = TouchCount, y = reorder(Team, TouchCount, median), fill = Team)) +
  geom_density_ridges(scale = 2, alpha = 0.8, color = "white") +
  labs(
    title = "Touches per Game Distribution by Team",
    x = "Touches per Game",
    y = "Team (Sorted by Median Touches)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

############################ Within-Team Variability in Touch Frequency ############################ 

# Looks at the variability a team has across matches throughout the season

# Count touches per team per game from CoreHyp data frame
Touches_per_game <- Touches_CoreHyp %>%
  group_by(Team, SeasonMatchNumber) %>%
  summarise(TouchCount = n(), .groups = "drop")

# Computing Within-Team Variability
Team_touch_variability <- Touches_per_game %>%
  group_by(Team) %>%
  summarise(
    MeanTouches = mean(TouchCount),
    SDTouches = sd(TouchCount),
    MinTouches = min(TouchCount),
    MaxTouches = max(TouchCount),
    NumGames = n(),
    .groups = "drop"
  )

# Join season rank to each team for ordering in the plot
Touches_per_game_ranked <- Touches_per_game %>%
  mutate(Team = str_pad(as.character(Team), width = 2, pad = "0")) %>%
  left_join(FinalStandings %>% select(TeamID, Rank), by = c("Team" = "TeamID")) %>%
  filter(!is.na(Rank))  # make sure we only include ranked teams

# Visualize 
TouchesPerGame_vs_rank <- ggplot(Touches_per_game_ranked, aes(x = Rank, y = TouchCount, group = Rank)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  scale_x_reverse(breaks = 1:14) +  # clean 1–14 axis
  labs(
    title = "Variation of Within-Team Touch Frequency per Game",
    x = "Team (Ordered by Final Rank)",
    y = "Touches per Game"
  ) +
  theme_minimal()

# Scale touch based on distance from mean (MAD-based z-score)
# How extreme a touch count is compared to team's norm
Touches_scaled <- Touches_per_game %>%
  group_by(Team) %>%
  mutate(
    MedianTouch = median(TouchCount),
    MAD = mad(TouchCount),  # median absolute deviation
    ScaledTouch = (TouchCount - MedianTouch) / MAD
  ) %>%
  ungroup()

# Join Touches_scaled with ranks
Touches_scaled_ranked <- Touches_scaled %>%
  mutate(Team = str_pad(as.character(Team), width = 2, pad = "0")) %>%
  left_join(FinalStandings %>% select(TeamID, Rank), by = c("Team" = "TeamID")) %>%
  filter(!is.na(Rank))

# Plot of MAD
MAD_TouchesPerGame_vs_rank <- ggplot(Touches_scaled_ranked, aes(x = Rank, y = ScaledTouch, group = Rank)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  geom_hline(yintercept = c(-2, 2), linetype = "dashed", color = "red") +
  scale_x_reverse(breaks = 1:14) +  # clean 1–14 axis
  labs(
    title = "Scaled Touch Deviation from Team Median",
    subtitle = "Boxplot of (TouchCount - Median) / MAD per Team",
    x = "Team (Ordered by Final Rank)",
    y = "Scaled Touch Value (MAD Units)"
  ) +
  theme_minimal()

############################ Within-Team Variability in Touch Frequency vs Ranking ############################ 

# Join variability data to final standings
Variability_vs_Rank <- Team_touch_variability %>%
  mutate(Team = str_pad(as.character(Team), width = 2, pad = "0")) %>%
  left_join(FinalStandings %>% select(TeamID, Rank), by = c("Team" = "TeamID")) %>%
  filter(!is.na(Rank))

# Plot SDTouches vs Rank
Within_Variability_vs_Rank <- ggplot(Variability_vs_Rank, aes(x = Rank, y = SDTouches)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = FALSE, color = "blue", linewidth = 1) +
  scale_x_reverse(breaks = 1:14) +
  labs(
    title = "Team Variability in Touch vs Final Rank",
    x = "Final Season Rank",
    y = "Touch Frequency Variability (SD)"
  ) +
  theme_minimal()

Within_Variability_vs_Rank_Stats <- cor.test(Variability_vs_Rank$SDTouches, Variability_vs_Rank$Rank)

#In regards to within-tea variability including MAD normalization

Team_scaled_variability <- Touches_scaled %>%
  group_by(Team) %>%
  summarise(
    SD_ScaledTouch = sd(ScaledTouch),
    NumGames = n(),
    .groups = "drop"
  )

Team_scaled_variability_ranked <- Team_scaled_variability %>%
  mutate(Team = str_pad(as.character(Team), width = 2, pad = "0")) %>%
  left_join(FinalStandings %>% select(TeamID, Rank), by = c("Team" = "TeamID")) %>%
  filter(!is.na(Rank))

Team_scaled_variability_ranked_plot <- ggplot(Team_scaled_variability_ranked, aes(x = Rank, y = SD_ScaledTouch)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) +
  scale_x_reverse(breaks = 1:14) +
  labs(
    title = "Team Variability (SD of Scaled Touch) vs Final Rank",
    x = "Final Season Rank",
    y = "SD of Scaled Touch (MAD units)"
  ) +
  theme_minimal()

ScaledTouch_Variability_vs_Rank_Stats <- cor.test(
  Team_scaled_variability_ranked$SD_ScaledTouch,
  Team_scaled_variability_ranked$Rank
)

############################ Investigating data after removing outliers ############################ 

Touches_filteredoutliers <- Touches_scaled %>%
  filter(abs(ScaledTouch) <= 2)

Touches_outliers_removed <- Touches_scaled %>%
  filter(abs(ScaledTouch) > 2)

#Yes, the filtering included both high and low outliers — anything more than 2 MAD units away from the team median, in either direction (too high or too low), was removed.
library(patchwork)

# Full data
p1 <- ggplot(Touches_scaled, aes(x = TouchCount)) +
  geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
  labs(
    title = "Touches per Game (All Games)",
    x = "Touches per Game",
    y = "Number of Matches"
  ) +
  theme_minimal()

# Filtered data
p2 <- ggplot(Touches_filteredoutliers, aes(x = TouchCount)) +
  geom_histogram(binwidth = 5, fill = "darkgreen", color = "white") +
  labs(
    title = "Touches per Game (Outliers Removed)",
    x = "Touches per Game",
    y = "Number of Matches"
  ) +
  theme_minimal()

# Display side-by-side
p1 + p2

############################ Removal of Outliers | Core Hypothesis ############################ 

Touches_filteredoutliers  # contains Team, SeasonMatchNumber, TouchCount
## # A tibble: 344 × 6
##    Team  SeasonMatchNumber TouchCount MedianTouch   MAD ScaledTouch
##    <chr> <chr>                  <int>       <dbl> <dbl>       <dbl>
##  1 01    105                       12        23.5  9.64      -1.19 
##  2 01    109                       17        23.5  9.64      -0.674
##  3 01    116                       35        23.5  9.64       1.19 
##  4 01    127                       27        23.5  9.64       0.363
##  5 01    137                       17        23.5  9.64      -0.674
##  6 01    147                       17        23.5  9.64      -0.674
##  7 01    148                       13        23.5  9.64      -1.09 
##  8 01    156                       25        23.5  9.64       0.156
##  9 01    16                        18        23.5  9.64      -0.571
## 10 01    164                       18        23.5  9.64      -0.571
## # ℹ 334 more rows
# Get only the core hypothesis touches from the non-outlier matches
Touches_CoreHyp_Clean <- Touches_CoreHyp %>%
  semi_join(Touches_filteredoutliers, by = c("Team", "SeasonMatchNumber"))

Touches_by_team_clean <- Touches_CoreHyp_Clean %>%
  mutate(Team = str_trim(as.character(Team))) %>%
  count(Team, name = "TotalTouches")

Team_Touches_Standings_clean <- FinalStandings %>%
  left_join(Touches_by_team_clean, by = c("TeamID" = "Team")) %>%
  filter(!is.na(TotalTouches))

TouchFreq_vs_FinalStandings_clean <- ggplot(Team_Touches_Standings_clean, aes(x = Rank, y = TotalTouches)) +
  geom_point(size = 3, color = "darkgreen") +
  geom_smooth(method = "lm", se = FALSE, color = "darkgreen", linewidth = 1) +
  scale_x_reverse() +
  labs(
    title = "Final Rank vs Touch Frequency (Cleaned, Outliers Removed)",
    x = "Final Season Rank",
    y = "Total Touches (No Outlier Matches)"
  ) +
  theme_minimal()

TouchFreq_vs_FinalStandings_clean_Stats <- cor.test(
  Team_Touches_Standings_clean$TotalTouches,
  Team_Touches_Standings_clean$Rank
)

original_plot <- ggplot(Team_Touches_Standings, aes(x = Rank, y = TotalTouches)) +
  geom_point(size = 3, color = "steelblue") +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue", linewidth = 1) +
  scale_x_reverse() +
  labs(
    title = "Original: Rank vs Touch Frequency (All Matches)",
    x = "Final Season Rank",
    y = "Total Touches"
  ) +
  theme_minimal()



original_plot + TouchFreq_vs_FinalStandings_clean

############################ Deep Dive Outlier Analysis ############################ 

Matches_foroutliers <- Matches_final %>%
  mutate(TeamID = substr(as.character(MatchID), 1, 2))

Touches_outliers_removed <- Touches_outliers_removed %>%
  mutate(OutlierStatus = "Outlier")

Touches_filteredoutliers <- Touches_filteredoutliers %>%
  mutate(OutlierStatus = "Normal")

# Combine both into one data frame
TouchMatch_Comparison <- bind_rows(Touches_outliers_removed, Touches_filteredoutliers) %>%
  mutate(Team = str_pad(as.character(Team), width = 2, pad = "0")) %>%
  left_join(Matches_foroutliers, by = c("Team" = "TeamID", "SeasonMatchNumber"))

TouchMatch_Comparison <- TouchMatch_Comparison %>%
  mutate(Team = str_pad(as.character(Team), width = 2, pad = "0"))

TouchMatch_Comparison <- TouchMatch_Comparison %>%
  mutate(
    GoalsFor = case_when(
      GoalsFor %in% c("X", "XX") ~ "0",
      TRUE ~ GoalsFor
    ),
    GoalsFor = as.numeric(GoalsFor)
  )

# Compute sample size per group
n_labels <- TouchMatch_Comparison %>%
  group_by(OutlierStatus) %>%
  summarise(
    n = n(),
    y_pos = max(GoalsFor, na.rm = TRUE) + 0.5  # position just above max value
  )

#goals scored
ggplot(TouchMatch_Comparison, aes(x = OutlierStatus, y = GoalsFor)) +
  geom_boxplot(fill = "gray", outlier.shape = NA) +
  geom_jitter(width = 0.2, alpha = 0.6, color = "darkred") +
  geom_text(data = n_labels, aes(x = OutlierStatus, y = y_pos, label = paste0("n = ", n)), vjust = 0) +
  labs(
    title = "Goals Scored in Outlier vs Normal Matches",
    x = "Match Type",
    y = "Goals Scored"
  ) +
  theme_minimal()

#proportion of wins
TouchMatch_Comparison %>%
  group_by(OutlierStatus, Outcome) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(OutlierStatus) %>%
  mutate(
    prop = n / sum(n),
    label_y = prop + 0.03  # slightly above the bar
  ) %>%
  ggplot(aes(x = OutlierStatus, y = prop, fill = Outcome)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = paste0("n = ", n), y = label_y), 
            position = position_dodge(width = 0.9), 
            vjust = 0, size = 3.5) +
  labs(
    title = "Match Outcomes in Outlier vs Normal Matches",
    x = "Match Type",
    y = "Proportion of Matches"
  ) +
  theme_minimal()

# New n_labels specifically for Outcome x OutlierStatus
n_labels_touch <- TouchMatch_Comparison %>%
  group_by(Outcome, OutlierStatus) %>%
  summarise(
    n = n(),
    y_pos = max(TouchCount, na.rm = TRUE) + 2,
    .groups = "drop"
  )

#TouchCount vs Match Outcome
ggplot(TouchMatch_Comparison, aes(x = Outcome, y = TouchCount, fill = OutlierStatus)) +
  geom_boxplot(position = position_dodge(width = 0.75)) +
  geom_text(data = n_labels_touch,
            aes(x = Outcome, y = y_pos, label = paste0("n = ", n), group = OutlierStatus),
            position = position_dodge(width = 0.75),
            vjust = 0, size = 3.5) +
  labs(
    title = "Touch Count by Match Result and Outlier Status",
    x = "Match Result",
    y = "Touches",
    fill = "Match Type"
  ) +
  theme_minimal()

#do outlier matches have significantly higher goals? 
t.test(GoalsFor ~ OutlierStatus, data = TouchMatch_Comparison)
## 
##  Welch Two Sample t-test
## 
## data:  GoalsFor by OutlierStatus
## t = -2.1975, df = 21.71, p-value = 0.03896
## alternative hypothesis: true difference in means between group Normal and group Outlier is not equal to 0
## 95 percent confidence interval:
##  -1.07625361 -0.03072314
## sample estimates:
##  mean in group Normal mean in group Outlier 
##              1.296512              1.850000
#  Outlier matches (those with abnormally high/low touch rates) are associated with higher average goals scored than normal matches — by ~0.55 goals on average.

# This supports the idea that "touchy" or "chaotic" matches may coincide with more offensive action (more goals).
wilcox.test(GoalsFor ~ OutlierStatus, data = TouchMatch_Comparison)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  GoalsFor by OutlierStatus
## W = 2396.5, p-value = 0.0175
## alternative hypothesis: true location shift is not equal to 0
#There is a statistically significant difference in the distribution of goals scored between outlier and normal matches (p = 0.0175).
#This confirms the earlier t-test finding — but with fewer assumptions (no need for normality or equal variances).

#statistically more likely to have wins?
# Create a contingency table of outcomes by outlier status
table_outcomes <- TouchMatch_Comparison %>%
  count(OutlierStatus, Outcome) %>%
  pivot_wider(names_from = Outcome, values_from = n, values_fill = 0) %>%
  column_to_rownames("OutlierStatus") %>%
  as.matrix()

# Chi-squared test (good for larger samples)
chisq.test(table_outcomes)
## 
##  Pearson's Chi-squared test
## 
## data:  table_outcomes
## X-squared = 10.269, df = 2, p-value = 0.005891
# Optional: Fisher's Exact Test (more accurate with small samples)
fisher.test(table_outcomes)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  table_outcomes
## p-value = 0.008557
## alternative hypothesis: two.sided
#ya it is stats sig

# Create a contingency table of outcomes by outlier status
table_outcomes <- table(TouchMatch_Comparison$OutlierStatus, TouchMatch_Comparison$Outcome)

# View the raw counts
print(table_outcomes)
##          
##             D   L   W
##   Normal   64 146 134
##   Outlier   1   4  15
# View the proportions per match type
prop.table(table_outcomes, margin = 1)  # margin = 1 → row-wise proportions (within each match type)
##          
##                   D         L         W
##   Normal  0.1860465 0.4244186 0.3895349
##   Outlier 0.0500000 0.2000000 0.7500000
chisq.test(table_outcomes)     # For general large-sample significance
## 
##  Pearson's Chi-squared test
## 
## data:  table_outcomes
## X-squared = 10.269, df = 2, p-value = 0.005891
fisher.test(table_outcomes)    # Better for small sample sizes
## 
##  Fisher's Exact Test for Count Data
## 
## data:  table_outcomes
## p-value = 0.008557
## alternative hypothesis: two.sided
#Outlier matches — defined as games with unusually high (MAD > 2) prosocial touch counts — are strongly associated with higher win rates. This supports the idea that elevated team touch behavior may correlate with better team performance.


############################ Outlier Summary: Which Teams, How Many, Avg Touches ############################ 

# Step 1: Compute per-team summary for Normal and Outlier matches
Outlier_Summary <- bind_rows(Touches_outliers_removed, Touches_filteredoutliers) %>%
  mutate(
    Team = str_pad(as.character(Team), width = 2, pad = "0"),
    OutlierStatus = factor(OutlierStatus, levels = c("Normal", "Outlier"))
  ) %>%
  group_by(Team, OutlierStatus) %>%
  summarise(
    n_matches = n(),
    avg_touches = mean(TouchCount, na.rm = TRUE),
    min_touches = min(TouchCount, na.rm = TRUE),
    max_touches = max(TouchCount, na.rm = TRUE),
    .groups = "drop"
  )

# Step 2 (Optional): Wide format for side-by-side comparison
Outlier_Summary_wide <- Outlier_Summary %>%
  pivot_wider(
    names_from = OutlierStatus,
    values_from = c(n_matches, avg_touches, min_touches, max_touches),
    names_glue = "{.value}_{OutlierStatus}"
  )

# Step 3: Plot number of outlier vs normal matches per team
ggplot(Outlier_Summary, aes(x = Team, y = n_matches, fill = OutlierStatus)) +
  geom_col(position = "dodge") +
  labs(
    title = "Number of Outlier vs Normal Matches per Team",
    x = "Team",
    y = "Number of Matches",
    fill = "Match Type"
  ) +
  theme_minimal()

# Step 4 (Optional): View summary table
View(Outlier_Summary_wide)

library(ggplot2)

# Filter just the outlier matches
Touches_outliers_removed %>%
  ggplot(aes(x = Team, y = TouchCount, fill = ScaledTouch > 0)) +
  geom_boxplot(outlier.shape = NA, alpha = 0.5) +
  geom_jitter(width = 0.2, alpha = 0.7, aes(color = ScaledTouch)) +
  geom_hline(yintercept = median(Touches_outliers_removed$TouchCount), linetype = "dashed", color = "black") +
  scale_color_gradient2(low = "blue", mid = "gray", high = "red", midpoint = 0, name = "Scaled Touch") +
  labs(
    title = "Touch Frequencies in Outlier Matches",
    subtitle = "Color shows direction of deviation: blue = low, red = high",
    x = "Team",
    y = "Touches in Outlier Match"
  ) +
  theme_minimal() +
  theme(legend.position = "right")

# Add a flag for whether a match is an outlier
Touches_scaled <- Touches_scaled %>%
  mutate(OutlierFlag = ifelse(abs(ScaledTouch) > 2, "Outlier", "Normal"))

# Plot all matches with MAD-scaled values, color by outlier status
ggplot(Touches_scaled, aes(x = Team, y = ScaledTouch, color = OutlierFlag)) +
  geom_jitter(width = 0.2, alpha = 0.7, size = 2) +
  scale_color_manual(values = c("Normal" = "gray", "Outlier" = "red")) +
  geom_hline(yintercept = c(-2, 2), linetype = "dashed", color = "black") +
  labs(
    title = "MAD-Scaled Touch Frequency per Match by Team",
    subtitle = "Outliers (|ScaledTouch| > 2) shown in red",
    x = "Team",
    y = "Scaled Touch (MAD Units)",
    color = "Match Type"
  ) +
  theme_minimal()

##################Parametric analysis

shapiro.test(Touches_scaled$TouchCount)         # Raw touches per match
## 
##  Shapiro-Wilk normality test
## 
## data:  Touches_scaled$TouchCount
## W = 0.93604, p-value = 2.121e-11
shapiro.test(Team_Touches_Standings$TotalTouches)  # Total touches per season
## 
##  Shapiro-Wilk normality test
## 
## data:  Team_Touches_Standings$TotalTouches
## W = 0.94314, p-value = 0.46
shapiro.test(Touches_scaled$ScaledTouch)
## 
##  Shapiro-Wilk normality test
## 
## data:  Touches_scaled$ScaledTouch
## W = 0.9318, p-value = 7.52e-12
shapiro.test(Team_touch_variability$MeanTouches)
## 
##  Shapiro-Wilk normality test
## 
## data:  Team_touch_variability$MeanTouches
## W = 0.94314, p-value = 0.46
shapiro.test(TouchMatch_Comparison$GoalsFor)
## 
##  Shapiro-Wilk normality test
## 
## data:  TouchMatch_Comparison$GoalsFor
## W = 0.86541, p-value < 2.2e-16
shapiro.test(TouchMatch_Comparison$TouchCount)
## 
##  Shapiro-Wilk normality test
## 
## data:  TouchMatch_Comparison$TouchCount
## W = 0.93604, p-value = 2.121e-11
shapiro.test(Team_Touches_Standings$TotalTouches)
## 
##  Shapiro-Wilk normality test
## 
## data:  Team_Touches_Standings$TotalTouches
## W = 0.94314, p-value = 0.46
# Histogram
ggplot(Touches_scaled, aes(x = TouchCount)) +
  geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
  labs(title = "Distribution of Touch Count", x = "Touch Count", y = "Frequency") +
  theme_minimal()

# Q-Q Plot
qqnorm(Touches_scaled$TouchCount)
qqline(Touches_scaled$TouchCount, col = "red")

mad_val <- mad(Team_Touches_Standings$TotalTouches)
sd_val <- sd(Team_Touches_Standings$TotalTouches)
mad_to_sd_ratio <- mad_val / sd_val

library(ggplot2)

ggplot(Team_Touches_Standings, aes(x = TotalTouches)) +
  geom_density(fill = "lightblue", alpha = 0.5) +
  geom_vline(aes(xintercept = median(TotalTouches)), color = "blue", linetype = "dashed") +
  geom_vline(aes(xintercept = median(TotalTouches) + mad(TotalTouches)), color = "red", linetype = "dotted") +
  geom_vline(aes(xintercept = median(TotalTouches) - mad(TotalTouches)), color = "red", linetype = "dotted") +
  labs(title = "Density Plot with MAD Bands", x = "Total Touches", y = "Density") +
  theme_minimal()

Results for Sub-Hypothesis 1

The above graph is showing a breakdown of within-team touch variation per game. Each block is one team, with the most successful team ranked in place 1 on the far right. Each team played 26 games throughout the season. Therefore, each block has 26 data points from which is being built.

The above graph is similar to the previous one, however it scales the touches.

MAD was used instead of SD due to the non-normal nature of the data.

The following formulas were used:

Mean Absolute Deviation (MAD) \[ \text{MAD} = \text{median} \left( \left| X_i - \text{median}(X) \right| \right) \] Scaled Touch \[ \text{ScaledTouch}_i = \frac{ X_i - \text{median}(X) }{ \text{MAD} } \]

Then I looked at the with-in team variation across teams. In other words, does consistency of touch frequencies correlate to end of season standings? Scaled touch version:

Team_scaled_variability_ranked_plot
## `geom_smooth()` using formula = 'y ~ x'

ScaledTouch_Variability_vs_Rank_Stats
## 
##  Pearson's product-moment correlation
## 
## data:  Team_scaled_variability_ranked$SD_ScaledTouch and Team_scaled_variability_ranked$Rank
## t = 1.5994, df = 12, p-value = 0.1357
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1432663  0.7769560
## sample estimates:
##       cor 
## 0.4191765

Not statistically significant. However, there is a moderate positive correlation: a higher deviation of scaled touch is somewhat associated with worse ranking. In other words, there is a slight possible relationship between more consistency of touch frequency and end of season ranking.

Hypothesis 2: Inter-Match Variability

The frequency of pro-social touch across teams may vary due to confounding variables such as team culture or tactics. Therefore, instead of analyzing team to team variations, we suggest that the inter-match variability of pro-social touches within an individual team could be used as an indicator for individual match outcomes. We propose that teams with more pro-social touches within a match than their respective season average are more likely to have a higher goal differential.

Statistically Significant Results. Yay.

This is where Anne would like me to blow out to 14 graphs per team.

Note: Whereas hypothesis 1 looked at outcomes for the whole season, hypothesis 2 is looking at a match level outcome.

For each team, we calculated how many touches they made in a match above or below their own season average (in raw touch counts). We then tested whether matches where a team touched more than usual were associated with better goal differential.

To further test whether match outcomes differed between matches with above- or below-average team touch behavior, we conducted a Wilcoxon rank-sum test (non-parametric, due to non-normal distribution of goal differentials). The test compared goal differential in matches where teams touched more than their season average versus those with below-average touch counts.

Scaling of touch took a simpler approach given that we are looking at individual matches. I did how ‘touchy’ this match was compared to a team’s season average:

\[ \text{ScaledAboveAvg} = \text{TouchCount}_{\text{match}} - \text{SeasonAvgTouch}_{\text{team}} \]

# Inter-Match Variability Hypothesis
# For more information on these data frames please look at the README.md file

library(tidyverse)
library(data.table)
library(broom)
library(janitor)
library(readxl)
library(rmarkdown)
library(readr)
library (dplyr)

#Ensure to use the correct dfs. Touches_final and Matches_final are correct. They only include assigned rater data, no repeat matches

#Check to make sure data frames are loaded:
if (!exists("Touches_final") | !exists("Matches_final") | !exists("FinalStandings") | !exists("Touches_CoreHyp")) {
  stop("Touches_final, Matches_final, Touhe_CoreHyp or FinalStandings not loaded. Check Data_Management.R and Core_Hypothesis.R.")
}

############## Inter-Match Variability Hypothesis ############## 

# Basically asking: "When a team is more (or less) touchy than usual, do they score more or less goals than their opponent?"
# Note that this is still using same CoreHyp touches (therefore only prosocial touches and not including GF, GA, Subs etc)

#Clean Match column data for use
Matches_final_cleaned <- Matches_final %>%
  mutate(
    GoalsFor = as.numeric(str_trim(GoalsFor)),
    GoalsAgainst = as.numeric(str_trim(GoalsAgainst))
  )

#Get TeamID into Matches_final
Matches_finalID <- Matches_final_cleaned %>%
  mutate(
    MatchID = str_pad(MatchID, width = 4, pad = "0"),  # in case it was shortened
    TeamID = str_sub(MatchID, 1, 2),                     # preserve leading zeros
    GoalDiff = GoalsFor - GoalsAgainst
  )

# Prosocial touches per team per match
Touches_per_match <- Touches_CoreHyp %>%
  group_by(Team, SeasonMatchNumber) %>%
  summarise(TouchCount = n(), .groups = "drop")

# Average touch per team over the season
Team_season_avg <- Touches_per_match %>%
  group_by(Team) %>%
  summarise(SeasonAvgTouch = mean(TouchCount), .groups = "drop")

# Merge & calculate scaled deviation from average
Touches_scaled_dev <- Touches_per_match %>%
  left_join(Team_season_avg, by = "Team") %>%
  mutate(ScaledAboveAvg = TouchCount - SeasonAvgTouch) # positive = more touchy than average, neg = less touchy than average

# Joins dataframes (goal differetials to the touches scaled)
Touch_GoalDiff_Analysis <- Touches_scaled_dev %>%
  left_join(
    Matches_finalID %>% select(SeasonMatchNumber, TeamID, GoalDiff),
    by = c("SeasonMatchNumber", "Team" = "TeamID")
  )

# Visualize 
Touch_GoalDiff_Analysis_Graph <- ggplot(Touch_GoalDiff_Analysis, aes(x = ScaledAboveAvg, y = GoalDiff)) +
  geom_point(size = 2, alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(
    title = "Touch Count Deviation vs Match Goal Differential",
    x = "Touches Above/Below Team Average",
    y = "Goal Differential",
    caption = "Note: Each dot represents one match outcome for a team, therefore 2 dots for each match"
  ) +
  theme_minimal()

# Pearson
Touch_GoalDiff_Analysis_Stats <- cor.test(Touch_GoalDiff_Analysis$ScaledAboveAvg, Touch_GoalDiff_Analysis$GoalDiff)

Results for Hypothesis 2:

## 
##  Pearson's product-moment correlation
## 
## data:  Touch_GoalDiff_Analysis$ScaledAboveAvg and Touch_GoalDiff_Analysis$GoalDiff
## t = 5.6914, df = 312, p-value = 2.906e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2028881 0.4036665
## sample estimates:
##      cor 
## 0.306685

On the graph above: Positive values indicate matches where the team touched more than usual; negative values indicate matches with lower than usual touch counts.

We observed a statistically significant positive correlation between match-level touch deviation and goal differential (r = 0.31, p < 0.001, 95% CI [0.20, 0.40]). In other words, matches where a team touched more than their season average were associated with higher goal differentials.

Hypothesis 3: Underdog Hypothesis

Teams with greater frequencies of pro-social touch when competing against higher ranked teams will secure a better goal differential than teams with fewer touch instances facing an opponent of the same spread in rankings.

Not Statistically Significant.

Definition of spread: Spread in rankings is defined as the difference in current standings the teams have between each other. Therefore, the 1st ranked team competing against the 7th ranked team has the same spread as the 7th ranked team playing against the 14th ranked team. This could expose a team’s cohesion by how they react when facing adversity.

Higher positive spread = strong team vs weak team Negative spread = underdog team

In other words → touch might help underdog teams cope with adversity.

For each match, computed how touchy the team was vs. its typical behavior: ScaledTouch = (TouchCount - Median) / MAD

I used both an ANOVA (which described nothing of interest) and a GAM model.

Note: First 13 matches of season are excluded. Not all teams play the first weekend of the season. Therefore, the current standings only have a ‘comprehensive picture’ going into the 3rd weekend. In other words, after match 13, at least every team has completed 1 game.

Additionally, for the categorical analysis I arbitrarily determined that a spread between 0-7 was a “mild underdog” or a “mild favorite. A spread greater than 7 was considered”major”. This is simply because there are 14 teams in the league and a spread of 7 would place a team on teh other half of the table.

SpreadGroup = Major Underdog, Mild Underdog, Even, Mild Favorite, Major Favorite

TouchGroup = High Touch (≥1 MAD), Low Touch (≤-1 MAD), Average Touch

# Underdog Hypothesis

library(tidyverse)
library(data.table)
library(broom)
library(janitor)
library(readxl)
library(rmarkdown)
library(readr)
library (dplyr)
library(plotly)
library(mgcv)
library(ggplot2)
library(forcats)

#Ensure to use the correct dfs. Touches_final and Matches_final are correct. They only include assigned rater data, no repeat matches

#Check to make sure data frames are loaded:

if (!exists("Touches_final") | !exists("Touches_scaled") | !exists("Matches_finalID") | !exists("FinalStandings") | !exists("Touches_CoreHyp")) {
  stop("Touches_final, Matches_final, Touhe_CoreHyp or FinalStandings not loaded. Check Data_Management.R and Core_Hypothesis.R.")
}

############################ Underdog Hypothesis ############################

# Clean Match column data for use
Matches_final_cleaned_CurrentStandings <- Matches_final %>%
  mutate(
    GoalsFor = as.numeric(str_trim(GoalsFor)),
    GoalsAgainst = as.numeric(str_trim(GoalsAgainst)),
    CurrentStanding = as.numeric(str_trim(CurrentStanding)),
    MatchID = str_pad(MatchID, width = 4, pad = "0"),  # ensure 4-digit MatchID
    TeamID = str_sub(MatchID, 1, 2),                   # extract TeamID from MatchID
    SeasonMatchNumber = as.numeric(SeasonMatchNumber)  # ensure it can be compared numerically
  ) %>%
  filter(SeasonMatchNumber > 13)  # exclude first 13 matches

# Get the spread into the info for each team
Matches_final_Spread <- Matches_final_cleaned_CurrentStandings %>%
  rename_with(~ paste0(.x, "_self")) %>% #renames every column so that you know which row refers to the self team of analysis
  inner_join(
    Matches_final_cleaned_CurrentStandings,
    by = c("SeasonMatchNumber_self" = "SeasonMatchNumber") # joins data frame to itself, matching each game via seasonmatchnumber (_self is of interst) (wihtout is opponent)
  ) %>%
  filter(TeamID_self != TeamID) %>%  # Make sure we’re not joining a row to itself
  mutate(
    GoalDiff = GoalsFor_self - GoalsAgainst_self,
    Spread = CurrentStanding - CurrentStanding_self  # positive = better ranked than opponent, negative = underdog
  ) %>%
  select(
    SeasonMatchNumber = SeasonMatchNumber_self,
    MatchID = MatchID_self,
    TeamID = TeamID_self,
    GoalDiff,
    Spread,
    CurrentStanding = CurrentStanding_self,
    OpponentTeamID = TeamID,
    OpponentStanding = CurrentStanding
  )

#Correcting data types
Touches_scaled_numeric <- Touches_scaled %>%
  mutate(
    SeasonMatchNumber = as.numeric(SeasonMatchNumber),
    Team = as.character(Team)  # just to ensure consistency
  )

# Data frame creation for Underdog analysis
Underdog_Analysis <- Matches_final_Spread %>%
  left_join(
    Touches_scaled_numeric,
    by = c("SeasonMatchNumber", "TeamID" = "Team")
  )

############################ Observed Data Table Summary | Underdog Hypothesis ############################

spread_cutoff <- 7 #arbitrary spread number: I like 7 because it separates the table in half (1st rank team playing against bottom half of table)

# Categorize real match data into underdog/favored + touch level
Underdog_Observed_Summary <- Underdog_Analysis %>%
  filter(!is.na(GoalDiff) & !is.na(ScaledTouch) & !is.na(Spread)) %>%  # ensure clean data
  mutate(
    SpreadGroup = case_when(
      Spread <= -spread_cutoff ~ "Major Underdog",
      Spread > -spread_cutoff & Spread < 0 ~ "Mild Underdog",
      Spread == 0 ~ "Even",
      Spread > 0 & Spread < spread_cutoff ~ "Mild Favorite",
      Spread >= spread_cutoff ~ "Major Favorite"
    ),
    TouchGroup = case_when(
      ScaledTouch >= 1 ~ "High Touch",
      ScaledTouch <= -1 ~ "Low Touch",
      TRUE ~ "Average Touch"
    )
  ) %>%
  group_by(SpreadGroup, TouchGroup) %>%
  summarise(
    MeanObservedGD = mean(GoalDiff, na.rm = TRUE),
    n = n(),
    .groups = "drop"
  ) %>%
  arrange(SpreadGroup, TouchGroup)

#Bar chart for this data: I think easier to understand than the cool looking 3D chart generated below.
# Set factor levels for order
spread_levels <- c("Major Underdog", "Mild Underdog", "Even", "Mild Favorite", "Major Favorite")
touch_levels <- c("Low Touch", "Average Touch", "High Touch")

# Make sure SpreadGroup and TouchGroup are ordered
Underdog_Observed_Summary <- Underdog_Observed_Summary %>%
  mutate(
    SpreadGroup = factor(SpreadGroup, levels = spread_levels),
    TouchGroup = factor(TouchGroup, levels = touch_levels)
  )

# Plot observed data
Barchart_Categorical_Data <- ggplot(Underdog_Observed_Summary, aes(x = SpreadGroup, y = MeanObservedGD, fill = TouchGroup)) +
  geom_col(position = position_dodge(width = 0.8), width = 0.7) +
  geom_text(
    aes(label = paste0("n=", n)),
    position = position_dodge(width = 0.8),
    vjust = ifelse(Underdog_Observed_Summary$MeanObservedGD >= 0, -0.5, 1.2),
    size = 3.5
  ) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "Observed Goal Differential by Underdog/Favorite Status and Touch Level",
    x = "Underdog/Favorite Status (Spread Group)",
    y = "Mean Goal Differential",
    fill = "Touch Level"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

# Statistically compare goal differentials across spreadgroup and touchgroup
# Via a two-way ANOVa 
# Make sure grouping variables are factors
Underdog_Observed_ANOVA <- Underdog_Analysis %>%
  filter(!is.na(GoalDiff) & !is.na(ScaledTouch) & !is.na(Spread)) %>%
  mutate(
    SpreadGroup = case_when(
      Spread <= -spread_cutoff ~ "Major Underdog",
      Spread > -spread_cutoff & Spread < 0 ~ "Mild Underdog",
      Spread == 0 ~ "Even",
      Spread > 0 & Spread < spread_cutoff ~ "Mild Favorite",
      Spread >= spread_cutoff ~ "Major Favorite"
    ),
    TouchGroup = case_when(
      ScaledTouch >= 1 ~ "High Touch",
      ScaledTouch <= -1 ~ "Low Touch",
      TRUE ~ "Average Touch"
    ),
    SpreadGroup = factor(SpreadGroup, levels = spread_levels),
    TouchGroup = factor(TouchGroup, levels = touch_levels)
  )

# Run Two-Way ANOVA
anova_result <- aov(GoalDiff ~ SpreadGroup * TouchGroup, data = Underdog_Observed_ANOVA)
ANOVASUM <- summary(anova_result)
TukeryAnova <- TukeyHSD(anova_result)

#Summary Table of Means and SDs per Group
Underdog_Observed_ANOVA %>%
  group_by(SpreadGroup, TouchGroup) %>%
  summarise(
    Mean_GD = mean(GoalDiff, na.rm = TRUE),
    SD_GD = sd(GoalDiff, na.rm = TRUE),
    n = n(),
    .groups = "drop"
  ) %>%
  arrange(SpreadGroup, TouchGroup)
## # A tibble: 12 × 5
##    SpreadGroup    TouchGroup    Mean_GD SD_GD     n
##    <fct>          <fct>           <dbl> <dbl> <int>
##  1 Major Underdog Low Touch      -1.88   1.89     8
##  2 Major Underdog Average Touch  -1.08   1.41    25
##  3 Major Underdog High Touch     -0.6    1.67     5
##  4 Mild Underdog  Low Touch      -1.35   1.06    17
##  5 Mild Underdog  Average Touch  -0.365  1.52    63
##  6 Mild Underdog  High Touch      0.481  1.53    27
##  7 Mild Favorite  Low Touch      -0.714  1.70     7
##  8 Mild Favorite  Average Touch   0.266  1.51    79
##  9 Mild Favorite  High Touch      0.810  1.57    21
## 10 Major Favorite Low Touch       1.25   1.71     4
## 11 Major Favorite Average Touch   1.2    1.61    30
## 12 Major Favorite High Touch      1      1.41     4
#ANOVA categorically dos not say that spread AND touch together affects goal differential
#It does say that underdogs are more likely to lose (duh) and that more touch is better (duh)
#Look at GAM Model to see other stuff, there it creates a a relationship between the three values

############################ GAM Model | Underdog Hypothesis ############################

#This is asking: If a team is more/less touchy than usual, and they are underdog/overdog, how does this impact the goal differenetial?

# Fit GAM model to allow for nonlinear effects

# Changing k = 15 creates completely flattens slope
gam_model <- gam(
  GoalDiff ~ s(Spread, ScaledTouch, k = 100, bs = "tp"),  # increase k for smoother fit
  data = Underdog_Analysis
)

# Create grid for predictions
spread_seq <- seq(min(Underdog_Analysis$Spread, na.rm = TRUE),
                  max(Underdog_Analysis$Spread, na.rm = TRUE), length.out = 50)
touch_seq <- seq(min(Underdog_Analysis$ScaledTouch, na.rm = TRUE),
                 max(Underdog_Analysis$ScaledTouch, na.rm = TRUE), length.out = 50)

grid <- expand.grid(Spread = spread_seq, ScaledTouch = touch_seq)
grid$GoalDiff <- predict(gam_model, newdata = grid)

# Convert to matrix for surface
z_matrix <- matrix(grid$GoalDiff, nrow = length(spread_seq), ncol = length(touch_seq))

# 3D Plot: See below in next R chunk
#Note: Each dot represents one match outcome for a team, therefore 2 dots for each match
# plot_ly() %>%
#   add_surface(
#     x = ~spread_seq,
#     y = ~touch_seq,
#     z = ~z_matrix,
#     colorscale = list(
#       c(0, "red"),  # red for losses
#       c(1, "green")   # green for wins
#     ),
#     cmin = min(Underdog_Analysis$GoalDiff, na.rm = TRUE),
#     cmax = max(Underdog_Analysis$GoalDiff, na.rm = TRUE),
#     opacity = 0.7,
#     showscale = TRUE
#   ) %>%
#   add_markers(
#     data = Underdog_Analysis,
#     x = ~Spread,
#     y = ~ScaledTouch,
#     z = ~GoalDiff,
#     marker = list(
#       size = 3,
#       color = ~GoalDiff,
#       colorscale = list(c(0, "#ff0000"), c(1, "#00ff00")),  # flipped: red = low, green = high
#       cmin = min(Underdog_Analysis$GoalDiff, na.rm = TRUE),
#       cmax = max(Underdog_Analysis$GoalDiff, na.rm = TRUE)
#     ),
#     name = "Observed"
#   ) %>%
#   layout(
#     title = "Underdog Hypothesis: Spread x Touch Deviation x Goal Differential",
#     scene = list(
#       xaxis = list(title = "Spread (Opponent Rank - Team Rank)"),
#       yaxis = list(title = "Scaled Touch Deviation"),
#       zaxis = list(title = "Goal Differential")
#     )
#   )


############################ GAM Model CATEGORICAL Table Summary (Interpretation of GAM) | Underdog Hypothesis ############################

# Evaluates how a team's goal differential is predicated by a model across a spectrum of two predictors:
# Predictor One = Ranking Spread
# Predictor Two = Scaled Touch Deviation (how much more or less physical touch a team used compared to their norm)

# Backpedals to the observed table CATEGORICAL idea. 

# Define a grid of Spread and ScaledTouch values
# Create a sequence of 100 evenly spaced values from smallest to largest spread
spread_vals <- seq(min(Underdog_Analysis$Spread, na.rm = TRUE),
                   max(Underdog_Analysis$Spread, na.rm = TRUE),
                   length.out = 100)

# Create a sequence of 100 evenly spaced values from smallest to largest touch
touch_vals <- seq(min(Underdog_Analysis$ScaledTouch, na.rm = TRUE),
                  max(Underdog_Analysis$ScaledTouch, na.rm = TRUE),
                  length.out = 100)

# Create a ten thousand row data frame by combining all hundred by hundred values above
grid <- expand.grid(Spread = spread_vals, ScaledTouch = touch_vals)

# Predict GoalDiff across the grid on each fake game (ten thousand of them)
grid$PredictedGoalDiff <- predict(gam_model, newdata = grid)

#Note choice of 7 as the cutoff is somewhat arbitrary
grid_summary <- grid %>%
  mutate(
    SpreadGroup = case_when(
      Spread <= -spread_cutoff ~ "Major Underdog",
      Spread > -spread_cutoff & Spread < 0 ~ "Mild Underdog",
      Spread == 0 ~ "Even",
      Spread > 0 & Spread < spread_cutoff ~ "Mild Favorite",
      Spread >= spread_cutoff ~ "Major Favorite"
    ),
    TouchGroup = case_when(
      ScaledTouch >= 1 ~ "High Touch",
      ScaledTouch <= -1 ~ "Low Touch",
      TRUE ~ "Average Touch"
    )
  ) %>%
  group_by(SpreadGroup, TouchGroup) %>%
  summarise(
    MeanPredGD = mean(PredictedGoalDiff, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(SpreadGroup, TouchGroup)

#Bar graph to see the GAM model major/mild underdogs versus tables
# Set same factor levels for consistency
grid_summary <- grid_summary %>%
  mutate(
    SpreadGroup = factor(SpreadGroup, levels = spread_levels),
    TouchGroup = factor(TouchGroup, levels = touch_levels)
  )

# Plot GAM model data
GAM_Model_Data_Plot <- ggplot(grid_summary, aes(x = SpreadGroup, y = MeanPredGD, fill = TouchGroup)) +
  geom_col(position = position_dodge(width = 0.8), width = 0.7) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "GAM Model Predicted Goal Differential by Underdog/Favorite Status and Touch Level",
    x = "Underdog/Favorite Status (Spread Group)",
    y = "Predicted Goal Differential",
    fill = "Touch Level"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

#Output for Hypothesis 3:

Observed Mean Goal Differential by Spread and Touch Level
SpreadGroup TouchGroup MeanObservedGD n
Major Favorite Average Touch 1.20 30
Major Favorite High Touch 1.00 4
Major Favorite Low Touch 1.25 4
Major Underdog Average Touch -1.08 25
Major Underdog High Touch -0.60 5
Major Underdog Low Touch -1.88 8
Mild Favorite Average Touch 0.27 79
Mild Favorite High Touch 0.81 21
Mild Favorite Low Touch -0.71 7
Mild Underdog Average Touch -0.37 63
Mild Underdog High Touch 0.48 27
Mild Underdog Low Touch -1.35 17

Note: This graph is showing true, observed data. It is not from a model.

For the above graph, a categorical approach was used:

SpreadGroup = Major Underdog, Mild Underdog, Even, Mild Favorite, Major Favorite

TouchGroup = High Touch (≥1 MAD), Low Touch (≤-1 MAD), Average Touch

In the observed match data, goal differentials were associated with both touch level and underdog/favorite status. Among major underdogs, higher-touch matches were linked to more favorable outcomes (Mean GD = -0.60 for high-touch vs. -1.88 for low-touch matches). Mild underdogs also showed a similar pattern, with high-touch matches associated with positive mean goal differentials (+0.48), compared to negative outcomes for low-touch matches (-1.35). Mild favorites likewise showed more positive outcomes with higher touch. In contrast, among major favorites, touch level appeared to have less association with outcome, as these teams generally won regardless of touch level. These associations suggest that prosocial touch behaviors may be more strongly linked to match outcomes when teams face greater competitive challenges (as underdogs).

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## GoalDiff ~ s(Spread, ScaledTouch, k = 100, bs = "tp")
## 
## Parametric coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.057e-11  8.691e-02       0        1
## 
## Approximate significance of smooth terms:
##                       edf Ref.df     F p-value    
## s(Spread,ScaledTouch)   2      2 43.67  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.228   Deviance explained = 23.3%
## GCV = 2.2134  Scale est. = 2.1905    n = 290

The Generalized Additive Model (GAM) used the following form:

\[ \text{GoalDiff} \sim s(\text{Spread}, \text{ScaledTouch}) \]

where \(s()\) is a smooth spline function allowing for nonlinear effects of Spread and ScaledTouch on GoalDiff.

The GAM model revealed a significant nonlinear association between Spread, ScaledTouch, and Goal Differential (smooth term: edf = 2.0, F = 43.67, p < 0.001). The model explained approximately 23% of the variance in goal differential (adjusted R² = 0.23), supporting the hypothesis that pro-social touch behaviors are linked to match outcomes, particularly in relation to underdog status.

The data suggest that when teams are underdogs, using more prosocial touch is linked to better match outcomes — hinting that these behaviors might help teams maintain cohesion and resilience against stronger opponents.

I then created another bar chart from the model instead of observed values. So this bar chart shows the model’s predicted average outcome (GoalDiff) for each category.

grid_summary <- grid %>%
  mutate(
    SpreadGroup = case_when(
      Spread <= -spread_cutoff ~ "Major Underdog",
      Spread > -spread_cutoff & Spread < 0 ~ "Mild Underdog",
      Spread == 0 ~ "Even",
      Spread > 0 & Spread < spread_cutoff ~ "Mild Favorite",
      Spread >= spread_cutoff ~ "Major Favorite"
    ),
    TouchGroup = case_when(
      ScaledTouch >= 1 ~ "High Touch",
      ScaledTouch <= -1 ~ "Low Touch",
      TRUE ~ "Average Touch"
    )
  ) %>%
  group_by(SpreadGroup, TouchGroup) %>%
  summarise(
    MeanPredGD = mean(PredictedGoalDiff, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(SpreadGroup, TouchGroup)

GAM_Model_Data_Plot

knitr::kable(grid_summary, digits = 2, caption = "GAM Model Predicted Goal Differential by Spread and Touch Level")
GAM Model Predicted Goal Differential by Spread and Touch Level
SpreadGroup TouchGroup MeanPredGD
Major Favorite Average Touch 1.12
Major Favorite High Touch 2.73
Major Favorite Low Touch 0.46
Major Underdog Average Touch -1.22
Major Underdog High Touch 0.39
Major Underdog Low Touch -1.88
Mild Favorite Average Touch 0.36
Mild Favorite High Touch 1.97
Mild Favorite Low Touch -0.30
Mild Underdog Average Touch -0.46
Mild Underdog High Touch 1.15
Mild Underdog Low Touch -1.12

Nothing here. Touch helps all groups. It is additive, not synergistic.

Hypothesis 4: Social Network and Reciprocity

Hyp 4a had a statistically significant result. Hyp 4b did not.

Hypothesis 4.a: We propose that a higher ratio of reciprocal to non-reciprocal touches is indicative of better team cohesion and therefore correlates to a higher finish in the final season standings.

\[ \text{Reciprocal\ Ratio} = \frac{ \text{Number of Reciprocal Touches} }{ \text{Number of Non-Reciprocal Touches} } \]

Note: In addition to the situations mentioned previously, Injury Touches were excluded from the reciprocal ratio analysis as they are almost solely non-reciprocal. The injured party rarely touches back as they are in pain on the ground.

Hypothesis 4.b: We propose that teams with a more even distribution of prosocial touch events across players, both giving and receiving, will finish higher in the final season standings.

# Social Network Strength Hypothesis
# For more information on these data frames please look at the README.md file

library(tidyverse)
library(data.table)
library(broom)
library(janitor)
library(readxl)
library(rmarkdown)
library(readr)
library (dplyr)
library(plotly)
library(mgcv)
library(ggplot2)
library(forcats)
library(ggridges)
library(DescTools)


#Ensure to use the correct dfs. Touches_final and Matches_final are correct. They only include assigned rater data, no repeat matches

#Check to make sure data frames are loaded:
if (!exists("Touches_final") | !exists("Touches_scaled") | !exists("Matches_finalID") | !exists("FinalStandings") | !exists("Touches_CoreHyp")) {
  stop("Touches_final, Matches_final, Touhe_CoreHyp or FinalStandings not loaded. Check Data_Management.R and Core_Hypothesis.R.")
}

############################ Social Network Strength ############################

#Clean data to grab what we need: reciprocity, toucher numbers, touchee numbers
#Group event also count towards reciprocal, so long as it is not GF, SUB etc 

Touches_Reciprocal <- Touches_final %>% 
  mutate(
    Team = as.character(Team),
    Reciprocity = str_trim(Reciprocal) #cleans up white spaces in case
  )

FinalStandings <- FinalStandings %>%
  mutate(TeamID = as.character(TeamID)) %>%  # <--- force character
  mutate(TeamID = str_pad(TeamID, width = 2, pad = "0"))

############################ Reciprocal/Non-Reciprocal | Social Network Strength ############################

#Reciprocal vs non-reciprocal touch | simple final standings to ratio of reciprocal/non-reciprocal

# Count touches by team and reciprocity type
Reciprocity_by_Team <- Touches_Reciprocal %>%
  filter(!is.na(Reciprocity)) %>%
  mutate(
    ReciprocityType = case_when(
      Reciprocity %in% c("Y", "G") ~ "Reciprocal", #Reciprocal includes group events
      Reciprocity == "N" ~ "NonReciprocal",
      TRUE ~ NA_character_
    )
  ) %>%
  filter(!is.na(ReciprocityType)) %>%
  group_by(Team, ReciprocityType) %>%
  summarise(TouchCount = n(), .groups = "drop") %>%
  pivot_wider(names_from = ReciprocityType, values_from = TouchCount, values_fill = 0) %>%
  mutate(
    Reciprocal_To_NonRecip_Ratio = Reciprocal / NonReciprocal,
    DataFlag = if_else(Reciprocal + NonReciprocal == 0, "No Touch Data", "Data Available")
  )

#Join with Final Standings
Reciprocal_vs_Rank <- FinalStandings %>%
  mutate(TeamID = str_pad(as.character(TeamID), width = 2, pad = "0")) %>%
  left_join(Reciprocity_by_Team, by = c("TeamID" = "Team"))

#Plot
Reciprocal_Ratio_Graph <- ggplot(Reciprocal_vs_Rank, aes(x = Rank, y = Reciprocal_To_NonRecip_Ratio)) +
  geom_point(size = 3, color = "steelblue") +  # all teams shown in one color
  geom_smooth(
    data = filter(Reciprocal_vs_Rank, !is.na(Reciprocal_To_NonRecip_Ratio)),
    aes(x = Rank, y = Reciprocal_To_NonRecip_Ratio),
    method = "lm", se = FALSE, color = "black"
  ) +
  scale_x_reverse(breaks = 1:max(Reciprocal_vs_Rank$Rank)) +
  labs(
    title = "Reciprocal Touch Ratio vs Final Season Rank per Team",
    subtitle = "Higher ratios may reflect stronger intra-team cohesion",
    caption = "Note: Each dot represents one team",
    x = "Team Final Season Rank",
    y = "Team Touch Ratio (Reciprocal : Non-Reciprocal)",
  ) +
  theme_minimal()

#Stats Sum
reciprocal_lm <- lm(Rank ~ Reciprocal_To_NonRecip_Ratio, data = Reciprocal_vs_Rank)

summary_reciprocal_lm <- summary(reciprocal_lm)

#Creates Dataframe that Flags values that don't match with requirements of strings, jersey numbers, G, SU, ??
#Cleans common typos (which were many)
Touches_players_flagged <- Touches_final %>%
  mutate(
    ToucherNumber = as.character(ToucherNumber),
    ToucheeNumber = as.character(ToucheeNumber),
    PlayersInvolved = as.character(PlayersInvolved),
    PlayersInvolved = PlayersInvolved %>%
      str_replace_all("\\.\\s+", ",") %>%           # fix "10. 12" → "10,12"
      str_remove_all("[\"'`:;.`]") %>%              # Remove unwanted punctuation
      str_replace_all("\\s*,\\s*", ",") %>%         # Normalize commas and spacing
      str_replace_all("\\s+", "") %>%               # Remove stray spaces
      str_replace_all("(?<=\\d{2})(?=\\d{2}$)", ",") %>%       # Insert comma in "1210" → "12,10"
      str_replace(",+$", "") %>%                    # Remove trailing commas
      str_trim(),                                   # Clean up leading/trailing space 
    
    
    # Valid if it's a number, "G", "SU", or "??"
    ToucherNumber_Valid = str_detect(ToucherNumber, "^\\d{1,2}$|^G$|^SU$|^\\?\\?$"),
    ToucheeNumber_Valid = str_detect(ToucheeNumber, "^\\d{1,2}$|^G$|^SU$|^\\?\\?$"),
    
    # Valid PlayersInvolved: at least two elements (numbers or ?? or SU), comma-separated
    PlayersInvolved_List = str_split(PlayersInvolved, ",\\s*"),
    PlayersInvolved_Valid = map_lgl(PlayersInvolved_List, function(players) {
      cleaned <- str_trim(players)
      all_valid <- all(str_detect(cleaned, "^\\d+$|^\\?\\?$|^SU$"))
      has_multiple <- length(cleaned) >= 2
      all_valid && has_multiple
    })
  )

#Splits "12,04,09" back into "12, 04, 09"
Touches_players_flagged <- Touches_players_flagged %>%
  mutate(PlayersInvolved = str_split(PlayersInvolved, ","))

#Ignore those flagged values for now (fix them before final paper)
Touches_players_final <- Touches_players_flagged %>%
  filter(
    ToucherNumber_Valid,
    ToucheeNumber_Valid,
    PlayersInvolved_Valid
  )

############################ Player Concentration | Social Network Strength ############################

# Count number of times each player was the Toucher within each team
toucher_counts <- Touches_players_final %>%
  filter(!is.na(ToucherNumber) & ToucherNumber != "G") %>%
  group_by(Team, ToucherNumber) %>%
  summarise(TouchCount = n(), .groups = "drop")

Toucher_Plot <- ggplot(toucher_counts, aes(x = ToucherNumber, y = TouchCount)) +
  geom_col(fill = "steelblue") +
  facet_wrap(~ Team, scales = "free_x") +
  labs(
    title = "Toucher Frequency by Player Jersey Number",
    x = "Player (ToucherNumber)",
    y = "Number of Touches"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, vjust = 0.5),
    strip.text = element_text(face = "bold")
  )

#########Histograms for Touchee by team (creates 14 histograms for team by team analysis)

# Count number of times each player was the Touchee within each team
touchee_counts <- Touches_players_final %>%
  filter(!is.na(ToucheeNumber) & ToucheeNumber != "G") %>%
  group_by(Team, ToucheeNumber) %>%
  summarise(TouchCount = n(), .groups = "drop")

Touchee_Plot <- ggplot(touchee_counts, aes(x = ToucheeNumber, y = TouchCount)) +
  geom_col(fill = "steelblue") +
  facet_wrap(~ Team, scales = "free_x") +
  labs(
    title = "Touchee Frequency by Player Jersey Number",
    x = "Player (ToucheeNumber)",
    y = "Number of Touches"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, vjust = 0.5),
    strip.text = element_text(face = "bold")
  )

#########Histograms for players involved in any touch event, as toucher, touchee or group (creates 14 histograms for team by team analysis)

# Properly split and unnest
player_counts <- Touches_players_final %>%
  select(Team, PlayersInvolved) %>%
  separate_rows(PlayersInvolved, sep = ",\\s*") %>%
  mutate(PlayersInvolved = str_trim(PlayersInvolved)) %>%
  filter(PlayersInvolved != "") %>%
  group_by(Team, PlayersInvolved) %>%
  summarise(TouchCount = n(), .groups = "drop") %>%
  arrange(Team, desc(TouchCount))

AnyTouchPlot <- ggplot(player_counts, aes(x = fct_reorder(PlayersInvolved, -TouchCount), y = TouchCount)) +
  geom_col(fill = "steelblue") +
  facet_wrap(~ Team, scales = "free_x") +
  labs(
    title = "Player Touch Involvement (PlayersInvolved Column)",
    subtitle = "Each bar shows one player's total involvement across the season",
    x = "Player Jersey / Code",
    y = "Touch Count"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, vjust = 0.5, size = 8),
    strip.text = element_text(face = "bold")
  )


#########Histograms for players involved in group events by team (creates 14 histograms for team by team analysis)

#Filters for just rows that contain the "G"
group_touches <- Touches_players_final %>%
  filter(ToucherNumber == "G")

player_counts_groupsanalysis <- group_touches %>%
  select(Team, PlayersInvolved, ToucherNumber) %>%
  separate_rows(PlayersInvolved, sep = ",\\s*") %>%
  mutate(PlayersInvolved = str_trim(PlayersInvolved)) %>%
  filter(PlayersInvolved != "") %>%
  group_by(Team, PlayersInvolved) %>%
  summarise(TouchCount = n(), .groups = "drop") %>%
  arrange(Team, desc(TouchCount))

#Plots
GroupTouch_Plot <- ggplot(player_counts_groupsanalysis, aes(x = PlayersInvolved, y = TouchCount)) +
  geom_col(fill = "steelblue") +
  facet_wrap(~ Team, scales = "free_x") +
  labs(
    title = "Player Involvement in Group Events (Touches_final)",
    subtitle = "Each bar shows a player's total involvement in group touches",
    x = "Player Jersey",
    y = "Group Touch Count"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

#########Ridge plot that shows touchiest players stacked on top of each other by team (one plot for all 14 teams)

# Build long-format player count data (from cleaned PlayersInvolved)
player_ridgeplot_counts <- Touches_players_final %>%
  select(Team, PlayersInvolved) %>%
  separate_rows(PlayersInvolved, sep = ",\\s*") %>%
  mutate(PlayersInvolved = str_trim(PlayersInvolved)) %>%
  filter(PlayersInvolved != "") %>%
  group_by(Team, PlayersInvolved) %>%
  summarise(TouchCount = n(), .groups = "drop")

# Assign within-team ranks (1 = most touchy)
player_ridgeplot_counts <- player_ridgeplot_counts %>%
  group_by(Team) %>%
  arrange(desc(TouchCount)) %>%
  mutate(
    TouchiestRank = row_number()  # Numeric: 1 = most touchy
  ) %>%
  ungroup()

# Calculate standard deviation of TouchCount per team
team_SD <- player_ridgeplot_counts %>%
  group_by(Team) %>%
  summarise(TouchSD = sd(TouchCount), .groups = "drop")

# Step 2: Reorder Team factor based on flatness (ascending)
player_ridgeplot_counts <- player_ridgeplot_counts %>%
  left_join(team_SD, by = "Team") %>%
  mutate(Team = fct_reorder(Team, TouchSD, .desc = FALSE))  # flattest first

# Plot ridge plot using numeric ranks
StackedByTouchiest_Plot <- ggplot(player_ridgeplot_counts, aes(x = TouchiestRank, y = fct_rev(Team), height = TouchCount, group = Team)) +
  geom_ridgeline(stat = "identity", fill = "steelblue", color = "white", alpha = 0.8, scale = 0.9) +
  scale_x_continuous(breaks = 1:50, expand = c(0.01, 0)) +
  labs(
    title = "Player Touch Involvement Ridge Plot by Team",
    subtitle = "Teams ordered front-to-back from flattest to most peaked touch distributions",
    x = "Player Touch Rank within Team (1 = Most Involved)",
    y = "Team"
  ) +
  theme_minimal() +
  theme(
    panel.grid.major.y = element_blank(),
    axis.text.y = element_text(size = 8),
    plot.title = element_text(face = "bold")
  )

#Unstacked Ridge plot
Unstacked_RidgePlot_TouchiestPlayers <- ggplot(player_ridgeplot_counts, aes(x = TouchiestRank, y = Team, height = TouchCount, group = Team)) +
  geom_density_ridges(stat = "identity", scale = 1.5, fill = "steelblue", alpha = 0.7) +
  labs(
    title = "Distribution of Player Touch Involvement by Team",
    subtitle = "Teams ordered from flattest to most peaked distribution",
    x = "Player Touchiness Rank (1 = Most Involved)",
    y = "Team"
  ) +
  theme_minimal()

############################ Player Concentration | Social Network Strength to Final Standings ############################

# Merge flatness, player touchiness, and season rank into player data
player_touchiness_rank <- player_ridgeplot_counts %>%
  mutate(Team = as.character(Team)) %>%
  left_join(
    FinalStandings %>%
      mutate(TeamID = as.character(TeamID)) %>%
      select(TeamID, Rank),
    by = c("Team" = "TeamID")
  )

player_touchiness_rank <- player_touchiness_rank %>%
  mutate(
    Team = factor(Team),
    Team = fct_reorder(Team, Rank, .desc = FALSE)  # Rank 1 at top/front
  )

#filter to only top 18 players of touchiness... tails too long
player_touchiness_rank_top <- player_touchiness_rank %>%
  filter(TouchiestRank <= 18)

# Ridge plot (non-stacked), teams ordered by final season standings
TouchiestPlayers_Rank <- ggplot(player_touchiness_rank_top, aes(x = TouchiestRank, y = Team, height = TouchCount, group = Team)) +
  geom_density_ridges(stat = "identity", scale = 1.5, fill = "steelblue", alpha = 0.7) +
  labs(
    title = "Distribution of Player Touch Involvement by Team",
    subtitle = "Teams ordered by final season standings (Bottom of y-axis is 1st ranked team)",
    x = "Player Touchiness Rank (1 = Most Involved)",
    y = "Team"
  ) +
  theme_minimal() +
  theme(
    panel.grid.major.y = element_blank(),
    axis.text.y = element_text(size = 8),
    plot.title = element_text(face = "bold")
  )

#plot team flatness SD to final season rank in a simple scatter plot
#low standard deviation = flatter distribution (single players do not dominate touch interactions)
team_SD_rank <- team_SD %>%
  mutate(Team = as.character(Team)) %>%
  left_join(FinalStandings %>% select(TeamID, Rank), by = c("Team" = "TeamID"))

team_cv <- player_touchiness_rank_top %>%
  mutate(Team = as.character(Team)) %>%
  group_by(Team) %>%
  summarise(
    MeanTouches = mean(TouchCount),
    SDTouches = sd(TouchCount),
    CV = SDTouches / MeanTouches,
    .groups = "drop"
  )

team_cv_rank <- team_cv %>%
  left_join(FinalStandings %>% select(TeamID, Rank), by = c("Team" = "TeamID"))

#Look at CV (Coefficient of Variation): CV = SD / Mean -- normalized level of spread
# Plot scatter
CV_TouchConcentraion_Rank <- ggplot(team_cv_rank, aes(x = Rank, y = CV)) +
  geom_point(size = 3, color = "darkred") +
  geom_smooth(method = "lm", se = FALSE, color = "black", linetype = "dashed") +
  scale_x_reverse(breaks = 1:max(team_SD_rank$Rank)) +  # Lower rank = better
  labs(
    title = "Touch Concentration vs Final Season Rank",
    subtitle = "Higher SD = touches concentrated in fewer players",
    x = "Final Season Rank (1 = Best)",
    y = "Touch Count Standard Deviation (Per Team)"
  ) +
  theme_minimal()

CV_Stats <- cor.test(team_cv_rank$CV, team_cv_rank$Rank)

######Gini Coefficient################
team_gini <- player_touchiness_rank %>%
  group_by(Team) %>%
  summarise(
    Gini = Gini(TouchCount),
    .groups = "drop"
  )
team_gini_rank <- team_gini %>%
  mutate(Team = as.character(Team)) %>% 
  left_join(FinalStandings %>% select(TeamID, Rank), by = c("Team" = "TeamID"))

GiniPlot <- ggplot(team_gini_rank, aes(x = Rank, y = Gini)) +
  geom_point(size = 3, color = "darkgreen") +
  geom_smooth(method = "lm", se = FALSE, color = "black", linetype = "dashed") +
  scale_x_reverse(breaks = 1:max(team_gini_rank$Rank)) +  # Rank 1 = best, so reverse axis
  labs(
    title = "Gini Coefficient: Touch Inequality vs Final Season Rank",
    subtitle = "Higher Gini = More touches concentrated in fewer players",
    x = "Final Season Rank (1 = Best)",
    y = "Gini Coefficient of Touch Distribution"
  ) +
  theme_minimal()

GiniSum <- summary(lm(Gini ~ Rank, data = team_gini_rank))


#Percentage of touches from top 3 touchiest players
# Step 1: Sum total touches per team
team_total_touches <- player_touchiness_rank %>%
  group_by(Team) %>%
  summarise(TotalTouches = sum(TouchCount), .groups = "drop")

# Step 2: Get top 3 players per team and their touch count
top3_touches <- player_touchiness_rank %>%
  group_by(Team) %>%
  arrange(desc(TouchCount)) %>%
  slice_head(n = 3) %>%  # top 3 players
  summarise(Top3Touches = sum(TouchCount), .groups = "drop")

# Step 3: Merge and calculate proportion
touch_concentration <- top3_touches %>%
  left_join(team_total_touches, by = "Team") %>%
  mutate(Top3_Proportion = Top3Touches / TotalTouches)

# Step 4: Join with final standings to analyze relationship
touch_concentration <- touch_concentration %>%
  mutate(Team = as.character(Team)) %>%
  left_join(FinalStandings %>% select(TeamID, Rank), by = c("Team" = "TeamID"))

Concentration_top3 <- ggplot(touch_concentration, aes(x = Rank, y = Top3_Proportion)) +
  geom_point(size = 3, color = "darkblue") +
  geom_smooth(method = "lm", se = FALSE, color = "black", linetype = "dashed") +
  scale_x_reverse(breaks = 1:max(touch_concentration$Rank)) +
  labs(
    title = "Touch Concentration in Top 3 Players vs Final Season Rank",
    subtitle = "Higher values = more concentrated touch behavior in fewer players",
    x = "Final Season Rank (1 = Best)",
    y = "Proportion of Team Touches by Top 3 Players"
  ) +
  theme_minimal()

Concentration_top3_Stats <- cor.test(touch_concentration$Top3_Proportion, touch_concentration$Rank)

Concentration_top3_Sum <- summary(lm(Top3_Proportion ~ Rank, data = touch_concentration))

Output for Hypothesis 4a

## 
## Call:
## lm(formula = Rank ~ Reciprocal_To_NonRecip_Ratio, data = Reciprocal_vs_Rank)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8756 -2.5819  0.0691  2.4197  4.3007 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    23.517      5.341   4.403  0.00086 ***
## Reciprocal_To_NonRecip_Ratio  -18.308      6.022  -3.040  0.01027 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.273 on 12 degrees of freedom
## Multiple R-squared:  0.4351, Adjusted R-squared:  0.388 
## F-statistic: 9.242 on 1 and 12 DF,  p-value: 0.01027

Hypothesis 4a examined whether the ratio of reciprocal to non-reciprocal touch events was associated with final season standings. The results showed a significant association: teams with higher reciprocal touch ratios tended to finish higher in the final standings, suggesting that reciprocal touch behavior may serve as a useful indicator of positive team dynamics.

The linear model showed that reciprocal touch ratio was significantly associated with final season standings (β = -18.31, p = 0.01, R² = 0.44). Teams with higher reciprocal-to-non-reciprocal touch ratios tended to finish higher in the standings, supporting the hypothesis that reciprocal interactions may reflect stronger team cohesion.

TouchiestPlayers_Rank

CV_TouchConcentraion_Rank

CV_Stats
## 
##  Pearson's product-moment correlation
## 
## data:  team_cv_rank$CV and team_cv_rank$Rank
## t = -0.88655, df = 12, p-value = 0.3927
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6880078  0.3254580
## sample estimates:
##        cor 
## -0.2479353
Concentration_top3

Concentration_top3_Stats
## 
##  Pearson's product-moment correlation
## 
## data:  touch_concentration$Top3_Proportion and touch_concentration$Rank
## t = -1.4207, df = 12, p-value = 0.1809
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7575206  0.1892233
## sample estimates:
##        cor 
## -0.3794525

Hypothesis 4b evaluated whether a more even distribution of touch behaviors across players was linked to team success. Measures of touch concentration (Coefficient of Variation, Gini coefficient, and proportion of touches by top 3 players) were not significantly associated with final rank. This suggests that while reciprocal interactions may be indicative of team cohesion, overall equality of touch distribution across players does not appear to predict season outcomes.

Inter-Rater Reliability: ICC

To assess inter-rater reliability, we computed an Intraclass Correlation Coefficient (ICC), using a two-way random effects model (agreement, single rater). The ICC assesses consistency in the number of prosocial touches recorded by different raters across the same set of 12 matches. The resulting ICC value provides an estimate of the reliability of the coding protocol for touch frequency.

The observed ICC of 0.79 indicates that raters showed good consistency in coding prosocial touch frequency, providing confidence in the reliability of the dataset used for further analyses.

#Inter-Rater Analysis

#Goal is to determine how "similar" raters were
#We have a total of 12 matches that all raters watched

library(tidyverse)
library(data.table)
library(broom)
library(janitor)
library(readxl)
library(rmarkdown)
library(readr)
library(dplyr)
library(irr)
library(fuzzyjoin)
library(knitr)


source("Data_Management.R") #Runs and brings in Matches_final from Data_Management.R script

#Dataframe "Touches_interrater" contains all the touches recorded for this analysis

# Should I exclude
# Situations excluded from core analysis for these hypotheses: Goals For/Against, Substitutions
Exclude_Touch <- c("TA", "CO", "NEG")
Exclude_Situation <- c("GF", "GA", "SUB")

Interrater <- Touches_interrater %>%  #duplicate and filter
  filter(!(HapticRitual %in% Exclude_Touch)) %>%
  filter(!(Situation %in% Exclude_Situation))

############################ Simple Frequency Check ############################

#Count check per match (simply how many each rater saw)

touch_counts <- Interrater %>%
  group_by(SeasonMatchNumber, Rater) %>%
  summarise(TouchCount = n(), .groups = "drop")

# Plot frequency per match per rater
interrater_plot <- ggplot(touch_counts, aes(x = factor(SeasonMatchNumber), y = TouchCount, fill = Rater)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Touch Count per Match by Rater",
       x = "Season Match Number",
       y = "Touch Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

###ICC - Intraclass Correlation Coefficient - Interval data:

# reshape your data: rows = matches, columns = raters
icc_data <- Interrater %>%
  group_by(SeasonMatchNumber, Rater) %>%
  summarise(Count = n(), .groups = "drop") %>%
  pivot_wider(names_from = Rater, values_from = Count)

# Apply ICC (two-way random effects model)
icc_result <- icc(icc_data[,-1], model = "twoway", type = "agreement", unit = "single")

print(icc_result)
##  Single Score Intraclass Correlation
## 
##    Model: twoway 
##    Type : agreement 
## 
##    Subjects = 12 
##      Raters = 3 
##    ICC(A,1) = 0.788
## 
##  F-Test, H0: r0 = 0 ; H1: r0 > 0 
##  F(11,23.9) = 11.9 , p = 3.33e-07 
## 
##  95%-Confidence Interval for ICC Population Values:
##   0.551 < ICC < 0.926
############################ Simple Reciprocity Check ############################

# Updated exclusion list
Exclude_Situation_IT <- c("GF", "GA", "SUB", "IT")

Reciprocal_IR <- Touches_interrater %>%
  filter(!(HapticRitual %in% Exclude_Touch)) %>%
  filter(!(Situation %in% Exclude_Situation_IT)) %>%
  mutate(
    Reciprocity = str_trim(Reciprocal),
    ReciprocityType = case_when(
      Reciprocity %in% c("Y", "G") ~ "Reciprocal",
      Reciprocity == "N" ~ "NonReciprocal",
      TRUE ~ NA_character_
    )
  ) %>%
  filter(!is.na(ReciprocityType))

#Count how many touches each rater coded as reciprocal vs non-reciprocal 
Recip_Counts <- Reciprocal_IR %>%
  group_by(SeasonMatchNumber, Rater, ReciprocityType) %>%
  summarise(Count = n(), .groups = "drop") %>%
  pivot_wider(names_from = ReciprocityType, values_from = Count, values_fill = 0)


#Group by match and rater
Recip_Counts <- Reciprocal_IR %>%
  group_by(SeasonMatchNumber, Rater, ReciprocityType) %>%
  summarise(Count = n(), .groups = "drop") %>%
  pivot_wider(names_from = ReciprocityType, values_from = Count, values_fill = 0)

icc_data_recip <- Recip_Counts %>%
  select(SeasonMatchNumber, Rater, Reciprocal) %>%
  pivot_wider(names_from = Rater, values_from = Reciprocal)

icc_result_recip <- icc(icc_data_recip[,-1], model = "twoway", type = "agreement", unit = "single")

print(icc_result_recip)
##  Single Score Intraclass Correlation
## 
##    Model: twoway 
##    Type : agreement 
## 
##    Subjects = 12 
##      Raters = 3 
##    ICC(A,1) = 0.765
## 
##  F-Test, H0: r0 = 0 ; H1: r0 > 0 
##  F(11,10.8) = 16.2 , p = 3.4e-05 
## 
##  95%-Confidence Interval for ICC Population Values:
##   0.437 < ICC < 0.922
icc_data_nonrecip <- Recip_Counts %>%
  select(SeasonMatchNumber, Rater, NonReciprocal) %>%
  pivot_wider(names_from = Rater, values_from = NonReciprocal)

icc_result_nonrecip <- icc(icc_data_nonrecip[,-1], model = "twoway", type = "agreement", unit = "single")

print(icc_result_nonrecip)
##  Single Score Intraclass Correlation
## 
##    Model: twoway 
##    Type : agreement 
## 
##    Subjects = 12 
##      Raters = 3 
##    ICC(A,1) = 0.593
## 
##  F-Test, H0: r0 = 0 ; H1: r0 > 0 
##  F(11,23.8) = 5.6 , p = 0.000217 
## 
##  95%-Confidence Interval for ICC Population Values:
##   0.268 < ICC < 0.841
#Make some charts

# Filter for reciprocal touches
recip_counts_plot <- Reciprocal_IR %>%
  filter(ReciprocityType == "Reciprocal") %>%
  group_by(SeasonMatchNumber, Rater) %>%
  summarise(Count = n(), .groups = "drop")

# Plot
recip_plot <- ggplot(recip_counts_plot, aes(x = factor(SeasonMatchNumber), y = Count, fill = Rater)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Reciprocal Touch Count per Match by Rater",
    x = "Season Match Number",
    y = "Reciprocal Touch Count"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Filter for non-reciprocal touches
nonrecip_counts_plot <- Reciprocal_IR %>%
  filter(ReciprocityType == "NonReciprocal") %>%
  group_by(SeasonMatchNumber, Rater) %>%
  summarise(Count = n(), .groups = "drop")

# Plot
Nonrecip_plot <- ggplot(nonrecip_counts_plot, aes(x = factor(SeasonMatchNumber), y = Count, fill = Rater)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Non-Reciprocal Touch Count per Match by Rater",
    x = "Season Match Number",
    y = "Non-Reciprocal Touch Count"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))


##########Global
# Total: Keep everything except TA, CO, NEG (but DO allow "IT")
Interrater_Total <- Touches_interrater %>%
  filter(!(HapticRitual %in% Exclude_Touch)) %>%
  filter(!(Situation %in% Exclude_Situation)) %>%
  group_by(Rater) %>%
  summarise(Count = n(), .groups = "drop") %>%
  mutate(TouchType = "Total")

# Recip / NonRecip: Apply stricter filter (exclude IT)
Interrater_RecipFiltered <- Touches_interrater %>%
  filter(!(HapticRitual %in% Exclude_Touch)) %>%
  filter(!(Situation %in% Exclude_Situation_IT)) %>%
  mutate(
    Reciprocity = str_trim(Reciprocal),
    TouchType = case_when(
      Reciprocity %in% c("Y", "G") ~ "Reciprocal",
      Reciprocity == "N" ~ "NonReciprocal",
      TRUE ~ NA_character_
    )
  ) %>%
  filter(!is.na(TouchType))

Interrater_Specific <- Interrater_RecipFiltered %>%
  group_by(Rater, TouchType) %>%
  summarise(Count = n(), .groups = "drop")

# Combine all
Global_Touch_Summary <- bind_rows(Interrater_Total, Interrater_Specific) %>%
  mutate(TouchType = factor(TouchType, levels = c("Total", "Reciprocal", "NonReciprocal")))

global_icc_plot <- ggplot(Global_Touch_Summary, aes(x = TouchType, y = Count, fill = Rater)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Global Comparison of Touch Counts by Rater (Filtered Reciprocities Only)",
    x = "Touch Type",
    y = "Total Count",
    fill = "Rater"
  ) +
  theme_minimal()

##ICC Table

# Build total count per match/rater
icc_data_total <- Touches_interrater %>%
  filter(!(HapticRitual %in% Exclude_Touch)) %>%
  filter(!(Situation %in% Exclude_Situation)) %>%  # NOTE: only GF, GA, SUB excluded (not IT)
  group_by(SeasonMatchNumber, Rater) %>%
  summarise(Count = n(), .groups = "drop") %>%
  pivot_wider(names_from = Rater, values_from = Count)

# Already done earlier:
icc_result_total <- icc(icc_data_total[,-1], model = "twoway", type = "agreement", unit = "single")
icc_result_recip <- icc(icc_data_recip[,-1], model = "twoway", type = "agreement", unit = "single")
icc_result_nonrecip <- icc(icc_data_nonrecip[,-1], model = "twoway", type = "agreement", unit = "single")

# Extract results into a data frame
icc_summary <- tibble(
  TouchType = c("Total", "Reciprocal", "NonReciprocal"),
  ICC = c(icc_result_total$value, icc_result_recip$value, icc_result_nonrecip$value),
  Lower_CI = c(icc_result_total$lbound, icc_result_recip$lbound, icc_result_nonrecip$lbound),
  Upper_CI = c(icc_result_total$ubound, icc_result_recip$ubound, icc_result_nonrecip$ubound)
)

library(knitr)
library(kableExtra)

icc_summary %>%
  mutate(across(where(is.numeric), round, 3)) %>%
  kable(caption = "ICC Values by Touch Type", align = "c") %>%
  kable_styling(full_width = FALSE)
ICC Values by Touch Type
TouchType ICC Lower_CI Upper_CI
Total 0.788 0.551 0.926
Reciprocal 0.765 0.437 0.922
NonReciprocal 0.593 0.268 0.841

##  Single Score Intraclass Correlation
## 
##    Model: twoway 
##    Type : agreement 
## 
##    Subjects = 12 
##      Raters = 3 
##    ICC(A,1) = 0.788
## 
##  F-Test, H0: r0 = 0 ; H1: r0 > 0 
##  F(11,23.9) = 11.9 , p = 3.33e-07 
## 
##  95%-Confidence Interval for ICC Population Values:
##   0.551 < ICC < 0.926
##  Single Score Intraclass Correlation
## 
##    Model: twoway 
##    Type : agreement 
## 
##    Subjects = 12 
##      Raters = 3 
##    ICC(A,1) = 0.765
## 
##  F-Test, H0: r0 = 0 ; H1: r0 > 0 
##  F(11,10.8) = 16.2 , p = 3.4e-05 
## 
##  95%-Confidence Interval for ICC Population Values:
##   0.437 < ICC < 0.922

##  Single Score Intraclass Correlation
## 
##    Model: twoway 
##    Type : agreement 
## 
##    Subjects = 12 
##      Raters = 3 
##    ICC(A,1) = 0.593
## 
##  F-Test, H0: r0 = 0 ; H1: r0 > 0 
##  F(11,23.8) = 5.6 , p = 0.000217 
## 
##  95%-Confidence Interval for ICC Population Values:
##   0.268 < ICC < 0.841
ICC Values by Touch Type
TouchType ICC Lower_CI Upper_CI
Total 0.788 0.551 0.926
Reciprocal 0.765 0.437 0.922
NonReciprocal 0.593 0.268 0.841

Data/Sample Sizes

Quick Summary of Data/Sample Sizes:

Note the filtered touch count is what was mostly used in the above analysis. The cut from about 14000 touch instances to 7000 instances is due to excluding substitutes, goal celebrations, and situations such as collisions which were not defined as pro-social touches.

# Data Summary and Overview
# Gives Simple overall counts, tables, data make-up etc

library(tidyverse)
library(data.table)
library(broom)
library(janitor)
library(readxl)
library(rmarkdown)
library(readr)
library (dplyr)


############################ Complete data summary | No filtering ############################ 

#Create Tables that summarizes complete data of "Touches_final" and "Matches_final"

Touches_Summary <- Touches_final
Matches_Summary <- Matches_final

Total_Touch_Instance_count <- nrow(Touches_Summary) #count total number of touches recorded

Total_Match_Count <- n_distinct(Touches_Summary$SeasonMatchNumber) #total number of matches watched

Total_Teams <- n_distinct(Touches_Summary$Team) #teams recorded

Total_Matches_perTeam <- 26 #Matches each team played (verified below)

Summary_Table_A <- tibble(
  Variable = c("Total_Touch_Instance_count", "Total_Match_Count", "Total_Teams", "Total_Matches_perTeam"),
  Value = c(
    nrow(Touches_Summary),
    n_distinct(Touches_Summary$SeasonMatchNumber),
    n_distinct(Touches_Summary$Team),
    26  # Manually verified
  )
)

Matches_Summary <- Matches_Summary %>%
  mutate(TeamName = case_when(
    TeamName %in% c("Racing Louisville", "Racing louisville FC", "Louisville Racing") ~ "Racing Louisville FC",
    TeamName %in% c("NC Courage", "Carolina Courage") ~ "North Carolina Courage",
    TeamName %in% c("KC Current") ~ "Kansas City Current",
    TeamName %in% c("Chicago Redstar FC") ~ "Chicago Red Stars",
    TeamName %in% c("Portland Thorns") ~ "Portland Thorns FC",
    TeamName %in% c("Gotham FC") ~ "NJ/NY Gotham FC",
    TeamName %in% c("San Diego Wave") ~ "San Diego Wave FC",
    TeamName %in% c("Seattle Reign") ~ "Seattle Reign FC",
    TRUE ~ TeamName
  ))

TeamMatchCounts <- Matches_Summary %>%
  count(TeamName, name = "MatchesPlayed")

TeamMatchCounts <- TeamMatchCounts %>%
  left_join(Team_IDs, 
            by = c("TeamName" = "Team Name 2024 Season"))

ByTeam_Total_Touch_Instance_count <- Touches_Summary %>% 
  count(Team, name = "Total Season Touches") %>% 
  rename(TeamID = Team)

TeamMatchCounts <- TeamMatchCounts %>%
  left_join(ByTeam_Total_Touch_Instance_count, 
            by = c("TeamID" = "TeamID"))

############################ Core Hyp data summary | Includes filtering ############################ 

Filtered_Touch_Summary <- Touches_CoreHyp

Filtered_Touch_Instance_count <- nrow(Filtered_Touch_Summary)

#Count of frequency of touches per team
Filtered_Touches_by_team <- Touches_by_team

Filtered_TeamMatchCounts <- TeamMatchCounts %>%
  left_join(Filtered_Touches_by_team, 
            by = c("TeamID" = "Team"))

Summary_Table_B <- tibble(
  Variable = c("Total_Touch_Instance_count", "Filtered_Touch_Instance_count", "Total_Match_Count", "Total_Teams", "Total_Matches_perTeam"),
  Value = c(
    Total_Touch_Instance_count,
    Filtered_Touch_Instance_count,
    Total_Match_Count,
    Total_Teams,
    Total_Matches_perTeam
  )
)

knitr::kable(
  Summary_Table_B,
  digits = 0,
  caption = "Summary of Data and Sample Sizes"
)
Summary of Data and Sample Sizes
Variable Value
Total_Touch_Instance_count 14869
Filtered_Touch_Instance_count 7955
Total_Match_Count 182
Total_Teams 14
Total_Matches_perTeam 26